

#### **Microkernel Construction** I.12 – Review

Lecture Summer Term 2017 Wednesday 15:45-17:15 R 131, 50.34 (INFO)

#### Jens Kehne | Marius Hillenbrand Operating Systems Group, Department of Computer Science







#### Threading

- Thread state must be saved/restored on thread switch
- We need a Thread Control Block (TCB) per thread
   TCBs must be kernel objects
  - TCBs implement threads
- We often need to find
  - Any thread's TCB using its global ID
  - The currently executing thread's TCB (per processor)

At least partially. We have found some good reasons to implement parts of the TCB in user memory.

#### Thread Switch $A \rightarrow B$



Thread A is running in user mode

- Thread A experiences an end-of-time-slice or is preempted by a (device) interrupt
- We enter kernel mode
- The microkernel saves the status of thread A on A's TCB
- The microkernel loads the status of thread B from B's TCB
- We leave kernel mode
- Thread B is running in user mode







Operating Systems Group Department of Computer Science





























**10** 12.07.2017





#### Thread Switch with single kernel stack

















**15** 12.07.2017

Operating Systems Group Department of Computer Science





movl thread\_id, %eax movl %eax, %ebx







**17** 12.07.2017

Operating Systems Group Department of Computer Science



movl thread\_id, %eax movl %eax, %ebx andl mask\_version, %eax



Mask out lower bits



**18** 12.07.2017

Operating Systems Group Department of Computer Science









### 0-Mapping Trick Direct Addressing



- Allocate physical memory for TCBs on demand
  - Dependent on the max number of allocated TCBs
- Map all remaining TCBs to a 0-filled read-only page
  - Any access to unused threads will result in "invalid thread ID" (0)
  - Avoids additional check



 Virtual TCB array requires ≥ 256 MB virtual memory for 256k potential TCBs



- TCB pointer array requires 1 MB virtual memory for 256k potential threads
- Virtual TCB array requires ≥ 256 MB virtual memory for 256k potential TCBs

#### Physical TCB array (seL4)



- Problem: Virtual TCB lookups cause TLB misses
  - Virtual TCB lookup is on IPC path!
- Solution: Use physical memory instead
- + No TLB misses
- + Significantly faster overall (Nourai 2005)
- + Easy to verify
- Requires ≥ 256 MB of physical memory!
- MMU may not permit physical addressing
  - Can still emulate physical memory using huge pages + pinning





Thread state toggles frequently (per IPC)

- Delete/insert ready list is expensive
- Therefore: delete *lazily* from ready list





Thread state toggles frequently (per IPC)

- Delete/insert ready list is expensive
- Therefore: delete *lazily* from ready list





Thread state toggles frequently (per IPC)

- Delete/insert ready list is expensive
- Therefore: delete *lazily* from ready list





Thread state toggles frequently (per IPC)

- Delete/insert ready list is expensive
- Therefore: delete *lazily* from ready list





Thread state toggles frequently (per IPC)

- Delete/insert ready list is expensive
- Therefore: delete *lazily* from ready list
- Whenever reaching a non-ready thread
  - Delete it from list
  - Proceed with next









Ready list contains all threads except the currently running thread



Jens Kehne, Marius Hillenbrand – Microkernel Construction, SS 2017



Ready list contains all threads except the currently running thread





#### Ready list contains all threads except the currently running thread

■ ready ↔ waiting

Don't re-insert blocked thread





#### Ready list contains all threads except the currently running thread

■ ready ↔ waiting

Don't re-insert blocked thread





# **ADDRESS-SPACE LAYOUT**

Operating Systems Group Department of Computer Science

#### Address-Space Layout 32bits, Virtual TCBs







35

#### **Shared Region Synchronization**



| <ul> <li>We have</li> <li>Region shared among all address spaces</li> <li>Separate page table per address space</li> <li>Updates occur in dynamic region</li> <li>May lead to inconsistencies</li> </ul> |                                 |   |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|---|
| We need                                                                                                                                                                                                  |                                 |   |
| Some form of synchronization within<br>dynamic region                                                                                                                                                    |                                 |   |
| Make sure valid virtual memory                                                                                                                                                                           |                                 |   |
| mappings are synchronized                                                                                                                                                                                |                                 |   |
|                                                                                                                                                                                                          |                                 |   |
| phys mem                                                                                                                                                                                                 | Dynamic Static<br>region region | ر |

Operating Systems Group Department of Computer Science

#### TCB Area Synchronization Basic Algorithm



Dedicate one table as master

#### Synchronize with master table on page faults

Page fault algorithm:

if (master entry valid) {
 copy entry from master
} else {
 create new entry in master
 copy entry from master



Operating Systems Group Department of Computer Science



#### IPC

Operating Systems Group Department of Computer Science IPC – API



#### Operations

- Send to
- Receive from
- Receive
- Call
- Send to & Receive any
- Send to & Receive from
- Send async

#### Message Types

- Registers
- Strings
- Map pages

#### **Message Construction**



- Messages are stored in registers (MR<sub>0</sub> ... MR<sub>63</sub>)
- First register (MR<sub>0</sub>) acts as message tag
- Subsequent registers contain
  - Untyped words (u), and
  - Typed words (t) (e.g., map item, string item)



#### **Message Construction**



- Typed items occupy one or more words
- Four currently defined items
  - Map item (2 words)
  - Grant item (2 words)
  - String item (2+ words)
  - Capability (2 words)
- Typed items can have arbitrary order



#### **Map and Grant Items**



Two words
 Send base
 Fpage
 Lower bits of send base indicates map or grant item







42

#### **String Items**



Up to 4 MB (per string) Compound strings supported Allows scatter-gather Incorporates cacheability hints string pointer  $MR_{i+1}$ Reduce cache pollution 0hhC MR string length 0 for long copy operations String Item "hh" indicates cacheability hints for the string E.g., only use L2 cache, or do not use cache at all



#### **Receiving Messages**

- Receiver buffers are specified in registers (BR<sub>0</sub> ... BR<sub>33</sub>)
- First BR (BR<sub>0</sub>) contains "Acceptor"
  - May specify receive window (if not nil-fpage)
  - May indicate presence of receive strings/buffers (if s-bit set)

Acceptor 000s BR<sub>0</sub>

#### **Receiving Messages**







46 12.07.2017

**Receiving mappings in Fiasco.OC** 

**Operating Systems Group** Department of Computer Science

#### **Timeouts**



#### snd timeout, rcv timeout, xfer timeout snd, xfer timeout rcv



#### (specified by the partner thread)

Operating Systems Group Department of Computer Science

#### **Timeout Issues**



- What timeout values are typical or necessary?
- How do we encode timeouts to minimize space needed to specify all four values?

 $m_{(10)}$ 

e(5)

#### Timeout values

- $\blacksquare \infty$  (infinite)
  - Client waiting for a (trusted) server
- 0 (zero)
  - Server responding to a client
  - Polling
  - Specific time
    - 🗖 🗖 1 us 610 h (log)

 $2^e m \mu s$ 

#### **Timeout Issues**





#### What are Virtual Registers?



- Physical registers, or
- Non-pageable memory
- UTCBs hold the memory backed registers
  - UTCBs are thread local
  - UTCB can not be paged
    - No page faults
    - Registers always accessible







#### **Implementation Goal**

Most frequent kernel op: short IPC
 Thousands of invocations per second
 Performance is critical

- Structure IPC for speed
- Structure entire kernel to support fast IPC
- What affects performance?
  - Cache line misses
  - TLB misses
  - Memory references
  - Pipe stalls and flushes
  - Instruction scheduling



#### **Fast Path**

# Optimize for common cases Write in assembler Non-critical paths written in C++ But still fast as possible

#### Avoid high-level language overhead

- Function call state preservation
- Incompatible code optimizations

## We want every cycle possible! At least sometimes ...



#### **Avoid memory accesses**

Memory references are slow

Avoid in IPC

- E.g., use lazy scheduling
- Avoid in common case
  - E.g., (xfer) timeouts

#### Microkernel should minimize artifacts

- Cache pollution
- TLB pollution
- Memory bus

53











| E | FLAGS |
|---|-------|
|   | EIP   |
|   |       |

| SS | CS |
|----|----|
| DS | ES |
| FS | GS |







| EFLAGS |  |
|--------|--|
| EIP    |  |
|        |  |

| SS | CS |
|----|----|
| DS | ES |
| FS | GS |







| EFLAGS |     |  |
|--------|-----|--|
|        | EIP |  |
|        |     |  |

| SS | CS |
|----|----|
| DS | ES |
| FS | GS |





| EFLAGS | ; |
|--------|---|
| EIP    |   |
|        |   |

| SS | CS |
|----|----|
| DS | ES |
| FS | GS |





#### **String IPC / memcpy**

- Why?
  - Trust
  - Granularity
  - Synchronous ("atomic") transfer



60

#### Copy In – Copy Out

Copy into kernel buffer



#### Copy In – Copy Out

Copy into kernel bufferSwitch spaces



#### Copy In – Copy Out

- Copy into kernel buffer
- Switch spaces
- Copy out of kernel buffer
- Costs for *n* words
- 2×2n r/w operations
- Example: 8 words / cache
  - 3×*n*/8 cache lines
  - 1×n/8 cache misses (small n)
  - 4×n/8 cache misses (large n)



NARABER STREET





Select dest area (2x4 MB)



Karlsruhe Institute of Technology

Select dest area (2x4 MB)

#### Map into source AS (kernel)





- Select dest area (2x4 MB)
- Map into source AS (kernel)
- Copy data



- Select dest area (2x4 MB)
- Map into source AS (kernel)
- Copy data
- Switch to dest space





- Copy 2 page directory entries (PDEs) from dest
  - Addresses in temporary mapping area are resolved using dest's page tables





69

### String IPC: Better than shared memory?

#### Trust?

- Grant items prevent unmapping
- Granularity?
  - Sender decides memory layout
- Synchronous ("atomic") transfer?
  - Additional short IPC for signaling
- Tunneled page faults, copy area multiplexing
- Violates minimality

#### No string IPC in 3<sup>rd</sup> gen L4!





### MAPPING

Operating Systems Group Department of Computer Science



#### Mechanisms

## We need tools to build address spaces Map Unmap

- We need security
  - Access permissions [rwx]
- We need resource control
  - Page fault messages [detect page use]







### Unmap







### Grant







77

12.07.2017



### **Mapping Regions**

**78** 12.07.2017





### **Mapping Regions: Flex Pages**

### Abstraction: flex page

- Contiguous regions of virtual address space
  - Sparse physical mappings possible
- Called fpage
- Abstracts architecture's page sizes

### Fpage semantics

- Inseparable object
- Aligned to its size
- Size is power of 2, min. 4096=2<sup>12</sup> byte



### **Fpage Encoding**



fpage( base, size= $2^s$  ) s ≥ 12 base mod  $2^s = 0$ 

#### Туре

L4\_FPAGE\_SPECIAL = 0, L4\_FPAGE\_MEMORY = 1, L4\_FPAGE\_IO = 2, L4\_FPAGE\_OBJ = 3, //capability

### Special cases

- Complete address space (base=0, s=1)
- Nothing: nilpage (0)

- 5, //Capability



### **Mapping Pages**





### **Mapping Database**





82



## **INTERRUPTS + EXCEPTIONS**

### **Event Handling**



- 1. Program executes happily
- 2. Event occurs
- 3. Activate event handler
  - Save current state
  - Switch to privileged mode
  - Execute event handler
- 4. Fix the problem / handle event
- 5. End of event handling
  - Restore state
  - Switch to previous mode
  - Continue interrupted program
- 6. Program executes happily again



### Page Fault IPC





### **Page Fault Receive Window**





**86** 12.07.2017

### **New Exception Handling Model**





### Synchronous vs. asynchronous interrupt IPC







### Synchronous vs. asynchronous interrupt IPC



















### SECURITY





### Authentication



- Unforgeable endpoint identifiers
  - Thread ID of sender returned by kernel
  - Capabilities generated by kernel
  - Thread identifiers can be mapped to
    - Tasks
    - Users
    - Groups
    - Machines
    - Domains
  - Authentication is outside the microkernel any policy can be implemented



### Authorization

- Servers implement objects; clients access objects via IPC
- Servers receive unforgeable client identities from the IPC mechanism
  - Servers can implement arbitrary access control policy
- No special mechanisms needed in the microkernel

### Is this really true???

### Capabilites





### **Communication Spaces with capabilites**







### **Capability properties**

- Capabilities contain
  - Pointer to a kernel object
  - Access rights
- Capabilities live in kernel space
  - Not directly accessible to user
  - Referenced by index in per-AS capability array
- Capabilities provide:
  - Fine-grained access control
  - Local naming (name = idx in capability array)
    - Index has no meaning in other ASes!

### System calls with capabilities





### System calls with capabilities





### System calls with capabilities









**103** 12.07.2017















### **Other Key Ideas**

- Avoid memory
  - No indirection (TCB area)
  - Lazy scheduling
- Make clever use of HW features
- Serialize recursive algorithms
  - "Recursive" unmap



### **General Hints**

- Study concepts, not details
- Give short but detailed answers
- If you don't know the answer, think aloud
- Local IPC and Small spaces will NOT be on the exam!
- And most importantly

# Don't panic!